Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AN-380 Make call cache hashing strategy configurable per filesystem and backend #7683

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from

Conversation

jgainerdewar
Copy link
Collaborator

@jgainerdewar jgainerdewar commented Jan 24, 2025

Description

Code is ready for review, still todo:

  • More testing in a BEE

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@jgainerdewar jgainerdewar marked this pull request as ready for review February 4, 2025 21:48
@jgainerdewar jgainerdewar requested a review from a team as a code owner February 4, 2025 21:48
Copy link
Collaborator

@aednichols aednichols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the mental picture I developed during mob times. Nice work!

Comment on lines +532 to +539
###### Notes on GCS Hashing Strategy

For some high-throughput production use cases that run many, many copies of the same task differing by only one input file,
the collision rate of `crc32c` may be unacceptably high. To dramatically reduce the chance of collision at the cost of
reducing the collection of tasks that can be call cached, we recommend `hashing-strategy: ["md5", "identity"]`. This
will use `md5` hashes when they exist, and fall back to the very strict `identity` strategy when they do not. Because
all GCS files created by Cromwell are guaranteed to have `md5`, `identity` comes into play only for user-provided workflow
input files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanation, one thing to add could be "why?" md5 does not exist, namely parallel composite upload

@@ -174,23 +176,39 @@ object GcsBatchSizeCommand {
file.objectBlobId.map(GcsBatchSizeCommand(file, _))
}

case class GcsBatchCrc32Command(override val file: GcsPath, override val blob: BlobId, setUserProject: Boolean = false)
extends IoHashCommand(file)
case class GcsBatchHashCommand(override val file: GcsPath,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extremely optional bikeshedding.

I've never been the biggest fan of this name, since this command is for a single file... it does not contain a batch.

Doubly so now that GCP Batch competes with GCS batch for headspace.

Suggested change
case class GcsBatchHashCommand(override val file: GcsPath,
case class GcsHashCommand(override val file: GcsPath,

private def getFileHashForGcsPath(gcsPath: GcsPath): IO[FileHash] = delayedIoFromTry {
gcsPath.objectBlobId.map(id => FileHash(HashType.GcsCrc32c, gcsPath.cloudStorage.get(id).getCrc32c))
}
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also extremely optional, I wonder if we could apply ✨ More Types ✨ to DRY out this group of functions.

Suggested change
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] =
private def getFileHashForPath[T](path: T, hashStrategy: FileHashStrategy[T]): IO[Option[FileHash]] =

fsConfigs <- configurationDescriptor.backendConfig.as[Option[Config]]("filesystems").toList
fsKey <- fsConfigs.root.keySet().asScala
configKey = s"${fsKey}.caching.hashing-strategy"
// TODO this allows users to override with an empty list to prevent caching, is that desirable or a footgun?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a global call-caching.enabled flag that seems like a better choice.

I don't it makes sense to let users toggle call caching on a per-filesystem basis because one backend can use multiple filesystems in the same task (e.g. GCS + DRS).

I would make empty list an exception if call-caching.enabled is true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants